bedricks@ohsu.edu
  hillali@ohsu.edu
  OHSU Gaines Hall 19 & 21 (we're neighbors)

Preliminaries

For this workshop, Steven and I started with a whiteboard session where we came up with a bunch of ideas for exploring word use in these presidential debate transcripts. After much discussion, coffee, and lots of "it would be so cool if we could..." ideas, we settled on these questions to start our exploratory data analysis (EDA):

  1. Which words were the most popular at the debates?
    • We'll calculate frequency of each word.
  2. Who speaks the most?
    • We'll calculate share of words spoken by each candidate at each debate.
  3. Do some candidates use more complex words than others?
    • We'll calculate word length and number of syllables per word to quantify "word complexity".
  4. Do some candidates have a more diverse vocabulary than others?
    • We'll use candidate type-token ratios as an index of "lexical diversity": TTRs reflect the ratio of unique words to total words spoken; higher ratios suggesting a more diverse vocabulary and lower ratios suggesting a less diverse vocabulary.
  5. Which words are most characteristic of each candidate?
    • We'll use loglikelihood to measure this.
  6. Who is talking about whom?

Install and load packages

We'll use a number of R packages for this workshop (remember: install once per machine, load once per work session). You'll need all the following packages installed first. Here is a function for doing this in one fell swoop:

ipak <- function(pkg) {
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}
pkgs <- c("readr", "dplyr", "tidyr", "stringr", "lubridate", "ggplot2", "wordcloud", 
    "devtools", "RColorBrewer", "dygraphs", "xts")  # list packages needed
ipak(pkgs)  # take function, and give it that list

Now we'll need to load 2 packages that are available only on Github, so we'll use the devtools package to run the function install_github to install then load those two packages. Note that the devtools:: command is not needed here because we already loaded that package above. But this command is helpful if you just want to use a single function from a package like we are doing here.

devtools::install_github("jbkunst/d3wordcloud")
library(d3wordcloud)
devtools::install_github("dill/beyonce")
library(beyonce)

Read in the transcripts

We'll use the readr package to read in our master_counts.csv file.

master_counts <- read_csv("./data/master_counts.csv", col_names = FALSE)

Now we have created an R object called master_counts, and it contains all the data stored in the csv file we made in Python. Let's find out what kind of object we just made- fingers are crossed that it is a dataframe, which makes life easier in R...

str(master_counts)  # str here stands for structure
Classes 'tbl_df', 'tbl' and 'data.frame':   63763 obs. of  6 variables:
 $ X1: chr  "04 Feb 2016" "04 Feb 2016" "04 Feb 2016" "04 Feb 2016" ...
 $ X2: chr  "democrats" "democrats" "democrats" "democrats" ...
 $ X3: chr  "moderator" "candidate" "candidate" "candidate" ...
 $ X4: chr  "TODD" "SANDERS" "CLINTON" "SANDERS" ...
 $ X5: chr  "campaign" "results" "endorse" "california" ...
 $ X6: int  2 1 1 1 1 1 3 1 2 2 ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 6
  .. ..$ X1: list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ X2: list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ X3: list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ X4: list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ X5: list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ X6: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

Neat, data.frame is one of the classes, and you can see that we have 6 variables. Since we have a dataframe, we can re-name our 6 variables with more sensible names than X1, X2, etc. by using names().

names(master_counts) <- c("date", "party", "speaker_type", "speaker", "word", 
    "counts")

Did that work? We'll use dplyr::glimpse instead of str to do a quick sanity-check; compare the outputs for yourself.

glimpse(master_counts)  # dplyr verb for dataframes
Observations: 63,763
Variables: 6
$ date         <chr> "04 Feb 2016", "04 Feb 2016", "04 Feb 2016", "04 ...
$ party        <chr> "democrats", "democrats", "democrats", "democrats...
$ speaker_type <chr> "moderator", "candidate", "candidate", "candidate...
$ speaker      <chr> "TODD", "SANDERS", "CLINTON", "SANDERS", "SANDERS...
$ word         <chr> "campaign", "results", "endorse", "california", "...
$ counts       <int> 2, 1, 1, 1, 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1...

Data wrangling

Remove non-candidates

Looking at the above output, you can see that one of our variables is named speaker_type, which looks to be one of two character strings: moderator or candidate. Let's look more closely at this.

master_counts %>%
  select(speaker_type) %>% # select just this 1 variable
  distinct() # yup, just one of 2 things
# A tibble: 2 × 1
  speaker_type
         <chr>
1    moderator
2    candidate
master_counts %>%
  filter(speaker_type == "candidate") %>% # cands only
  distinct(speaker) %>% # unique names only
  arrange(speaker) # put list in alphabetical order
# A tibble: 31 × 1
    speaker
      <chr>
1  AUDIENCE
2  BROWNLEE
3      BUSH
4    CARSON
5    CHAFEE
6  CHRISTIE
7   CLINTON
8    CRAMER
9      CRUZ
10 EPPERSON
# ... with 21 more rows

So we confirmed that we have moderators and candidates as speakers, but woops! We see we have an 'AUDIENCE' candidate in the democratic party (who booed x2), and a lot of other unusual sounding speakers who are classified as candidates. Lesson of the day: always perform sanity checks with #otherpeoplesdata. Here is our plan:

  1. Make a list of all the speakers that are actual candidates, then filter only those candidates into my new dataframe, which we'll call cand_counts.
  2. We also discovered a bug on UCSB site for downloading transcripts from the Aug GOP debate, which cuts off about halfway through, so we will also exclude that debate.
  3. See where EPPERSON used the word "k"? We might want to look back at the Python script for parsing the transcripts and figure out what happened there, but for now we'll exclude 1-letter words. But we need a way to measure the length of each word, so we know when length = 1. We'll use the stringr package to calculate str_length, then use dplyr::filter to exclude rows where length = 1.
real_cands <- c("CLINTON", "SANDERS", "CHAFEE", "WEBB", "OMALLEY", "BUSH", "CARSON", 
    "CHRISTIE", "CRUZ", "HUCKABEE", "KASICH", "FIORINA", "RUBIO", "PAUL", "TRUMP", 
    "SANTORUM")
cand_counts <- master_counts %>%
  filter(speaker_type == "candidate" & # candidates only
           (speaker %in% real_cands) & # but real candidates only
           !(date =="06 Aug 2015") & # forget about this debate for now
           !(str_length(word) <= 1)) # exclude short words

If you look in your global environment now, you can see that my cand_counts dataframe still has 6 variables, just like my master_counts dataframe, but now I only have 43213 as opposed to 63763 rows or observations in my dataframe.

Remove stopwords

We'll use the quanteda package's English stopwords list to remove words that we likely don't care about.

quanteda::stopwords("english")  # see list of stopwords
`?`(stopwords  # read note of caution
)

As it states in the quanteda note of caution, stops words are an arbitrary choice and your mileage may vary here. Now we'll use our second verb for dataframes, dplyr::filter, to keep only those words stored in the word variable that are not in this list of stopwords. In R, ! is the logical operator that means "NOT."

cand_counts <- cand_counts %>% filter(!(word %in% stopwords("english")))  # keeps only rows with words not in list

Dealing with dates

The date of each debate is currently a character variable- we need to change it to a date! We'll use lubridate to make R recognize that a string like 04 Feb 2016 has specific information stored in it (as opposed to fruit salad, for example). Namely, the string 04 Feb 2016 tells me 3 important variables: day, month, and year, and in that order. Recognizing that information will allow us to sort by so that the debates are shown in chronological order when we start plotting. You could also calculate number of days/months/years between two debates. We'll let R know that we are inputting a string that is formatted day-month-year, so lubridate::dmy() is the right command for this job.

library(lubridate)
dmy("04 Feb 2016")  # test this out- does it work?
cand_counts <- cand_counts %>% mutate(date_time = dmy(date))
glimpse(cand_counts)  # what changed?
Observations: 41,609
Variables: 7
$ date         <chr> "04 Feb 2016", "04 Feb 2016", "04 Feb 2016", "04 ...
$ party        <chr> "democrats", "democrats", "democrats", "democrats...
$ speaker_type <chr> "candidate", "candidate", "candidate", "candidate...
$ speaker      <chr> "SANDERS", "CLINTON", "SANDERS", "SANDERS", "CLIN...
$ word         <chr> "results", "endorse", "california", "hat", "naval...
$ counts       <int> 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 8, 2, 1, 1, 1...
$ date_time    <date> 2016-02-04, 2016-02-04, 2016-02-04, 2016-02-04, ...

Adding word-level data

Now, let's calculate some new variables to describe the individual words spoken by the candidates: the length and number of syllables for each word. We'll use the stringr package, which gives us verbs for dealing with variables that contain character strings. We'll use it to strip the punctuation from each word (so "i'll" becomes "ill") then count the number of letters. Play around with these commands to get a feel for how the functions work (bonus points if you can identify who said the fruit salad line in the debates)!

line <- "The fruit salad of their life is what I will look at"
str_length(line)  # from stringr, number of characters (including spaces)
syllables(line)  # from quanteda, number of syllables
str_length("fruit")
syllables("fruit")

We'll add these new variables using dplyr::mutate to our cand_counts dataframe.

cand_counts <- cand_counts %>%
  mutate(word_nop = str_replace_all(word, "[[:punct:]]",""), # strip punctuation
         word_length = str_length(word_nop), # stringr package
         syl_count = syllables(word)) %>% # quanteda package
  select(-word_nop) # drop that, we won't use it again
head(cand_counts)
# A tibble: 6 × 9
         date     party speaker_type speaker       word counts  date_time
        <chr>     <chr>        <chr>   <chr>      <chr>  <int>     <date>
1 04 Feb 2016 democrats    candidate SANDERS    results      1 2016-02-04
2 04 Feb 2016 democrats    candidate CLINTON    endorse      1 2016-02-04
3 04 Feb 2016 democrats    candidate SANDERS california      1 2016-02-04
4 04 Feb 2016 democrats    candidate SANDERS        hat      1 2016-02-04
5 04 Feb 2016 democrats    candidate CLINTON      naval      1 2016-02-04
6 04 Feb 2016 democrats    candidate CLINTON    another      3 2016-02-04
# ... with 2 more variables: word_length <int>, syl_count <int>

Both our new variables, word_length and syl_count should be integers. Is that right? How can you tell?

Units of measurement vs analysis

An important thing to think about when starting out with any new dataset is to ask yourself: what is the unit of measurement? Usually the answer to this question starts with "I have one-row-per-___". Take a close look at our cand_counts dataframe. We have one-row-per-debate/speaker/word combination. For example, my first row currently looks like this:

head(cand_counts, n = 1)
# A tibble: 1 × 9
         date     party speaker_type speaker    word counts  date_time
        <chr>     <chr>        <chr>   <chr>   <chr>  <int>     <date>
1 04 Feb 2016 democrats    candidate SANDERS results      1 2016-02-04
# ... with 2 more variables: word_length <int>, syl_count <int>

We know that certainly there are other rows with that same date, and that there are other rows with that same date where the speaker is also SANDERS. But we know that the count of 1 there tells us that Bernie said the word results during that February 4 debate just once.

On the other hand, given data in this one-row-per-debate/speaker/word dataframe, we can wrangle our data into different units for analysis, as long as those units can be calculated from what was measured. There are several possible units of measurement we can get to from here:

  • One-row-per-debate: this data includes things like the date and order of the debate, the number of candidates present at the debate, total number of words spoken at the debate, etc.
  • One-row-per-candidate/debate. If we want to make any comparisons between candidates, for example, on

Per-debate data

There is some important debate-level information we'll want to use for our EDA. These include:

  1. The number of candidates on stage at each debate.
  2. The total number of words spoken at each debate.
  3. The chronological order the debates happened within each party.

Let's start with (1) and (2). We'll do these two counting tasks at the same time using more dplyr.

debates <- cand_counts %>%
  group_by(date_time, party) %>%
  summarise(tot_speakers = n_distinct(speaker), # count unique speakers
            tot_tokens = sum(counts)) # sum word counts
debates
Source: local data frame [12 x 4]
Groups: date_time [?]

    date_time       party tot_speakers tot_tokens
       <date>       <chr>        <int>      <int>
1  2015-09-16 republicans           10      11382
2  2015-10-13   democrats            4       6405
3  2015-10-28 republicans           10       7220
4  2015-11-10 republicans            8       7218
5  2015-11-14   democrats            2       4197
6  2015-12-15 republicans            9       8705
7  2015-12-19   democrats            2       5419
8  2016-01-14 republicans            7       8341
9  2016-01-17   democrats            2       4100
10 2016-01-28 republicans            7       6212
11 2016-02-04   democrats            2       5939
12 2016-02-06 republicans            7       8536

You can see now that we have one-row-per-debate (which, since the debates for the two different parties happen on different dates, means that we automatically also have one-row-per-debate/party combination). We can do a quick plot here to show the relationship between total words spoken and total speakers per debate. I'll do this by creating a new object, sp (for speaker plot), and add (+) each layer one at a time.

sp <- ggplot(debates, aes(x = date_time, y = tot_speakers, colour = party, group = party))
sp <- sp + geom_point()  # put dots at the x/y values
sp <- sp + geom_line()  # connect the dots
sp <- sp + scale_x_date(name = "")  # x-axis is a time variable
sp <- sp + scale_color_manual(values = c("dodgerblue", "firebrick"))
sp <- sp + scale_y_continuous(name = "number of candidates", breaks = seq(0, 
    10, 2))
sp <- sp + coord_cartesian(ylim = c(0, 10))
sp

This plot is a good sanity check- we know that the total number of speakers has gone down with (almost) each debate in the GOP, but not so much in the democratic ones since there have been 2 candidates for 4 debates now. So now onto (3): we want to make a variable for the order of the debates, separately for each party. We'll use dplyr again to add our new variable, debate_num.

debates <- debates %>%
  ungroup() %>%
  arrange(party, date_time) %>% # dems first, GOP second
  group_by(party) %>% # grouped by party
  mutate(debate_num = seq_along(date_time)) # count unique values in order
debates # did it work?
Source: local data frame [12 x 5]
Groups: party [2]

    date_time       party tot_speakers tot_tokens debate_num
       <date>       <chr>        <int>      <int>      <int>
1  2015-10-13   democrats            4       6405          1
2  2015-11-14   democrats            2       4197          2
3  2015-12-19   democrats            2       5419          3
4  2016-01-17   democrats            2       4100          4
5  2016-02-04   democrats            2       5939          5
6  2015-09-16 republicans           10      11382          1
7  2015-10-28 republicans           10       7220          2
8  2015-11-10 republicans            8       7218          3
9  2015-12-15 republicans            9       8705          4
10 2016-01-14 republicans            7       8341          5
11 2016-01-28 republicans            7       6212          6
12 2016-02-06 republicans            7       8536          7

Next, we need to combine the per-debate data we made (debate_data) with our per-word data (cand_counts). We'll call this new data frame debates.

debates <- debates %>%
  ungroup() %>% # this was grouped before, need to un-do
  left_join(cand_counts) %>%
  mutate(speaker = as.factor(speaker),
         party = as.factor(party),
         debate_num = as.factor(debate_num)) # useful for later plotting

What can we do with this new data? This is actually a little tricky, because remember that we have this separate counts variable, which we can't ignore. We'll need to figure out how to weight each of these variables by its count, which won't make much sense at the word level, but will make for a more accurate estimate of average length and number of syllables for each candidate. So let's stop thinking about per-word data, and start thinking about the data we have for each candidate.

Per-candidate data

Now we are ready for prime time! We need to calculate summary statistics at the candidate level. But because we are interested in looking at the candidates' performances across debates, we'll calculate these summary statistics for each candidate at each debate. Here are 6 variables we need:

  1. Number of tokens: how many words did each candidate speak during each debate?
  2. Number of types: how many unique words did each candidate speak during each debate?
  3. Share of words spoken (share_tokens): the proportion of words spoken during the debate by each candidate. This variable uses a per-debate variable, total words spoken (tot_tokens) as the denominator.
  4. Type-token ratio (ttr): the ratio of unique words (types) divided by total words (tokens) spoken by each candidate
  5. Mean word length (avg_length): mean word length used by each candidate. This variable uses a per-word variable, length of each word (word_length), weighted by counts as the numerator.
  6. Mean syllable count (avg_syl): mean number of syllables per word used by each candidate. This variable uses a per-word variable, syllable count per word (syl_count), weighted by counts as the numerator.
cand_by_debate <- debates %>%
  group_by(party, speaker, debate_num, date_time, tot_tokens, tot_speakers) %>% 
  summarise(tokens = sum(counts), #all words spoken
            types = n_distinct(word), #all unique words spoken
            num_length = sum(word_length*counts), #numerator in mutate later
            num_syl = sum(syl_count*counts)) %>% #numerator in mutate later
  mutate(share_tokens = tokens/tot_tokens, 
         ttr = types/tokens, 
         avg_length = num_length/tokens,
         avg_syl = num_syl/tokens) %>% 
  select(-num_length, -num_syl) %>% # don't need these anymore
  ungroup() # don't need grouping anymore

glimpse(cand_by_debate)
Observations: 70
Variables: 12
$ party        <fctr> democrats, democrats, democrats, democrats, demo...
$ speaker      <fctr> CHAFEE, CLINTON, CLINTON, CLINTON, CLINTON, CLIN...
$ debate_num   <fctr> 1, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 1, 2, 3, 4, ...
$ date_time    <date> 2015-10-13, 2015-10-13, 2015-11-14, 2015-12-19, ...
$ tot_tokens   <int> 6405, 6405, 4197, 5419, 4100, 5939, 6405, 4197, 5...
$ tot_speakers <int> 4, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 10, 10, 8, 9,...
$ tokens       <int> 762, 2305, 2192, 2921, 1952, 3006, 2103, 2005, 24...
$ types        <int> 411, 1018, 1073, 1223, 955, 1204, 881, 866, 1016,...
$ share_tokens <dbl> 0.11896956, 0.35987510, 0.52227782, 0.53902934, 0...
$ ttr          <dbl> 0.5393701, 0.4416486, 0.4895073, 0.4186922, 0.489...
$ avg_length   <dbl> 6.093176, 6.090239, 6.150091, 6.078740, 6.073770,...
$ avg_syl      <dbl> 1.937008, 1.934924, 1.946168, 1.958233, 1.918545,...

Now, length and the number of syllables for each word are represented here by means- but there is another way. We could instead count the number of words spoken by each candidate at each word length.

get_length_counts <- cand_counts %>%
  group_by(party, speaker, word_length) %>%
  summarise(length_count = sum(counts)) %>% # counting words by length
  ungroup() %>% 
  group_by(speaker) %>%
  mutate(tokens = sum(length_count), # counting words across lengths
         by_length_prop = length_count/tokens,
         length_cat = cut(word_length, breaks = c(1, 2, 3, 4, 5, 25),
                          labels = c("2", "3", "4", "5", "6+")),
         max_count = ifelse(by_length_prop == max(by_length_prop), 1, 0))
# do these proportions sum to 1?
get_length_counts %>% group_by(speaker) %>% summarise(sum_check = sum(by_length_prop))

We'll start with a word length plot, which I'll make as an object called wlp.

# stacked bar chart
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
wlp <- get_length_counts %>% ggplot(aes(x = speaker, y = by_length_prop, fill = factor(length_cat)))
wlp <- wlp + geom_bar(stat = "identity", alpha = 0.8)
wlp <- wlp + scale_fill_manual(values = mygop, name = "word length")
wlp <- wlp + scale_y_continuous(expand = c(0, 0))
wlp <- wlp + theme_fivethirtyeight()
wlp <- wlp + theme(axis.text.x = element_text(angle = 45, hjust = 1))
wlp

# color by word length, speaker on x, position dodge
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(x = speaker, y = by_length_prop, 
    fill = factor(length_cat)))
gg_length <- gg_length + geom_bar(stat = "identity", position = "dodge", alpha = 0.8)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(axis.text.x = element_text(angle = 45, hjust = 1))
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_length

# color by word length, speaker on x, position dodge
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(x = length_cat, y = by_length_prop))
gg_length <- gg_length + geom_bar(stat = "identity", alpha = 0.8)
gg_length <- gg_length + facet_grid(speaker ~ .)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_length

# color by word length, speaker on x, facet by length_cat
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(y = by_length_prop, fill = factor(length_cat)))
gg_length <- gg_length + geom_bar(stat = "identity", position = "dodge", alpha = 0.8)
gg_length <- gg_length + facet_grid(length_cat ~ speaker)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_length

Let's try to visualize this data.

lolli_plot <- ggplot(get_length_counts, aes(x = word_length, y = by_length_prop, 
    colour = factor(max_count))) + geom_segment(aes(xend = word_length, y = 0, 
    yend = by_length_prop), alpha = 0.7, lwd = 1) + geom_point(size = 2) + scale_colour_manual(values = c("firebrick", 
    "dodgerblue"), guide = FALSE) + scale_x_discrete(breaks = seq_along(get_length_counts$word_length), 
    name = "word length") + scale_y_continuous(name = "proportion of words") + 
    facet_wrap(~speaker)

lolli_plot %+% subset(get_length_counts, party == "republicans")

Party Animals

Let's start just by comparing the two parties.

ggplot(cand_by_debate, aes(x = party, y = avg_length)) + geom_boxplot()

ggplot(cand_by_debate, aes(x = party, y = avg_syl, group = speaker, fill = speaker)) + 
    geom_boxplot()

ggplot(cand_by_debate, aes(x = speaker, y = avg_syl, fill = speaker)) + geom_boxplot() + 
    facet_wrap(~party, scales = "free_x")

These are some seriously ugly plots. Here's a riddle: when is a boxplot not a box? Answer: when there is no variability, our box turns into a horizontal line, which you can see is the case for at least 3 candidates. Since I am plotting data per candidate per debate, this means that those candidates only have data for one debate. There are also a lot of candidates- let's winnow the field.

Winnowing the field

Let's look at the candidates who we know are still in as of 2016-02-27. First, let's create a variable that tells us who is in and who is out.

gop_in <- c("RUBIO", "BUSH", "CRUZ", "TRUMP", "CARSON", "KASICH")
dem_in <- c("SANDERS", "CLINTON")
cand_by_debate <- cand_by_debate %>% mutate(still_in = factor(ifelse(speaker %in% 
    c(gop_in, dem_in), 1, 0)))  # 1 is in

Share of words spoken

GOP: in vs. out

Out of curiousity, let's plot the the share of words spoken by each candidate per debate and see if those who are out tended to speak less.

cand_by_debate %>%
  filter(party == "republicans") %>%
  ggplot(aes(x = reorder(speaker, share_tokens), # sorted speakers
             y = share_tokens, # share of words
             group = speaker, # separate out speakers
             colour = still_in)) + # colour by in vs out
  geom_boxplot() +
  geom_point(position = position_dodge(width = .75))

Now we are going to create a new dataset with only those candidates that are still in.

in_cands <- cand_by_debate %>%
  filter(speaker %in% c(gop_in, dem_in)) # just the candidates I listed above
in_cands %>% # sanity check
  select(party, speaker) %>% # select just 2 variables
  distinct(speaker) # unique speakers
# A tibble: 8 × 1
  speaker
   <fctr>
1 CLINTON
2 SANDERS
3    BUSH
4  CARSON
5    CRUZ
6  KASICH
7   RUBIO
8   TRUMP

OK great, looks like that worked.

Let's start building a new plot.

Who speaks the most?

We'll start by plotting share of words spoken by each candidate at each debate.

speaker_share <- ggplot(data = in_cands, # the dataframe
       aes(x = debate_num, y = share_tokens, colour = speaker)) + 
  geom_point(size = 4) + # points are the geom
  scale_x_discrete(name = "") + # don't need to label
  scale_y_continuous(name = "share of words") + # do need to label
  theme_minimal()
speaker_share

Let's plot the parties separately, since obviously Clinton/Sanders do a lot more relative talking. We can take the plot we already made and force ggplot to use a different dataframe by using the %+% operator.

speaker_share %+% subset(in_cands, party == "republicans")

What single word would I change in this code to see the democrats?

Fierce colors

install.packages("devtools")
devtools::install_github("dill/beyonce")
library(beyonce)
beyonce_palette(66)
beyonce_palette(66)[4:11]
fierce_share <- ggplot(data = in_cands, 
                                aes(x = debate_num, 
                                    y = share_tokens, 
                                    colour = speaker)) +
  geom_point(size = 4) + 
  scale_x_discrete(name = "") + # this is pretty clearly the candidates' speakers
  scale_y_continuous(name = "share of words") +
  theme_minimal()
fierce_share

GOP only

fierce_share %+% subset(in_cands, party == "republicans") + scale_colour_manual(values = beyonce_palette(66)[3:8], 
    name = "Candidate")

GOP- add a geom layer = lines

fierce_share %+% subset(in_cands, party == "republicans") + geom_line(aes(group = speaker), 
    lwd = 1) + scale_colour_manual(values = beyonce_palette(66)[2:7], name = "Candidate")

democrats only

fierce_share %+% subset(in_cands, party == "democrats") + geom_line(aes(group = speaker), 
    lwd = 1) + scale_colour_manual(values = beyonce_palette(66)[10:11], name = "Candidate")

or we could FACET by party

fierce_share + geom_line(aes(group = speaker)) + facet_wrap(~party, scales = "free_x") + 
    scale_colour_manual(values = beyonce_palette(66)[c(3:7, 9:11)], name = "Candidate")

instead, facet by candidate to compare debates within each candidate

fierce_facet <- ggplot(data = cand_by_debate, aes(x = debate_num, y = share_tokens, 
    colour = speaker, fill = speaker, group = speaker)) + geom_point(size = 3) + 
    scale_colour_manual(values = beyonce_palette(66)) + scale_fill_manual(values = beyonce_palette(66)) + 
    scale_x_discrete(name = "") + scale_y_continuous(name = "share of words spoken") + 
    theme_bw() + theme(legend.position = "none")

GOP only

fierce_facet %+% subset(in_cands, party == "republicans") + geom_line() + facet_wrap(~speaker)  # connect the dots 

Characteristic words for each candidate

Step 1: load my function for calculating log-likelihood. You'll need to update the file path here

source("/Users/hillali/Documents/Talks/Reed/reed-nlp-and-r-workshop/log-likelihood.R")

Step 2: need to tidy our data

# library(tidyr)
to_ll <- debates %>% select(speaker, word, counts) %>% group_by(word, speaker) %>% 
    summarise(count_sum = sum(counts)) %>% spread(speaker, count_sum) %>% select(2:15, 
    word = word)
to_ll[is.na(to_ll)] <- 0

Step 3: run the log likelihood function! Input: to_ll dataframe

# FYI: this takes a while.
from_ll <- LL(to_ll, threshold = 25, sig.level = 10.83)
Error in if (speaker.word == 0) speaker.word <- 1e-04: argument is of length zero

Step 4: take the from_ll dataframe as output, and rank by speakers, keep only the top 25, and those in our "in" groups for each party

top10_by_speaker <- from_ll %>% group_by(speaker) %>% arrange(desc(LL)) %>% 
    mutate(num_rank = percent_rank(LL), row_rank = row_number(desc(num_rank))) %>% 
    filter(LL > 0, row_rank < 11, speaker %in% c(gop_in, dem_in))
Error in eval(expr, envir, enclos): object 'from_ll' not found

Step 4: plot it!

ggplot(top10_by_speaker, aes(x = speaker, y = row_rank)) + geom_label(aes(label = top10_by_speaker$word, 
    fill = top10_by_speaker$speaker), color = "white", fontface = "bold", size = 5) + 
    theme_bw() + theme(legend.position = 1, plot.title = element_text(size = 18), 
    axis.title.y = element_text(margin = margin(0, 10, 0, 0))) + labs(title = "Most Characteristic Words/Phrases per Person") + 
    xlab("") + ylab("Ranking") + scale_fill_manual(values = beyonce_palette(74)) + 
    annotation_custom(cartman, xmin = 0.5, xmax = 1.5, ymin = 0, ymax = -4) + 
    annotation_custom(stan, xmin = 1.5, xmax = 2.5, ymin = 0, ymax = -4) + annotation_custom(kyle, 
    xmin = 2.5, xmax = 3.5, ymin = 0, ymax = -4) + annotation_custom(kenny, 
    xmin = 3.5, xmax = 4.5, ymin = 0, ymax = -4) + annotation_custom(butters, 
    xmin = 4.5, xmax = 5.5, ymin = 0, ymax = -4) + annotation_custom(randy, 
    xmin = 5.5, xmax = 6.5, ymin = 0, ymax = -4)
Error in ggplot(top10_by_speaker, aes(x = speaker, y = row_rank)): object 'top10_by_speaker' not found

Word length

boxplots of average word length per GOP candidate across debates

cand_counts <- cand_counts %>% select(counts, everything())
expand_counts <- cand_counts[rep(row.names(cand_counts), cand_counts$counts), 
    2:9]
gop_word_length <- ggplot(data = subset(in_cands, party == "republicans"), aes(x = reorder(speaker, 
    avg_length), y = avg_length, colour = speaker)) + geom_boxplot(show.legend = FALSE, 
    outlier.size = 3) + scale_colour_manual(values = beyonce_palette(74)) + 
    scale_x_discrete(name = "") + scale_y_continuous(name = "average word length") + 
    coord_flip()
gop_word_length

let's add a layer of points on top

gop_word_length + geom_point(size = 3, show.legend = FALSE)

hard to see overlapping points though, add jitter and alpha for transparency

gop_word_length + geom_jitter(size = 3, alpha = 0.5, show.legend = FALSE, width = 0.3, 
    height = 0)

note: Trump is missing in debate 5

Syllables per word

of total words spoken, what percent were 2-3-4-5+ syllable words

ggplot(cand_counts, aes(x = syl_count, y = counts)) + geom_bar(stat = "identity") + 
    facet_wrap(~speaker)

Type/token ratio

Aesthetics I need: color/group by speaker; ttr on y-axis; debate number on x-axis The geom I want: lines

# install.packages('ggthemes') library(ggthemes)
ttr_plot <- ggplot(in_cands, aes(x = debate_num, y = ttr, color = speaker, group = speaker)) + 
    geom_line(lwd = 2) + geom_point(size = 3) + scale_colour_hc(name = "candidate") + 
    scale_x_discrete(name = "debate number") + scale_y_continuous(name = "type-token ratio")
ttr_plot

some extra fussing to add text labels and points, and themes, and change to date_time rather than debate number

# need to rotate x-axis and expand x axis
library(ggrepel)
ggplot(in_cands, aes(x = factor(date_time), y = ttr, color = speaker, group = speaker)) + 
    geom_line(lwd = 1.5) + geom_point(size = 2.5) + scale_colour_hc(guide = FALSE) + 
    scale_x_discrete(name = "debate number") + scale_y_continuous(name = "type-token ratio") + 
    geom_text_repel(data = in_cands[in_cands$debate_num == 1, ], aes(label = speaker), 
        size = 4, show.legend = F) + theme_fivethirtyeight()

or add a geom_jitter

ggplot(in_cands, aes(x = speaker, y = ttr, color = speaker, group = speaker)) + 
    geom_jitter(size = 2.5, width = 0.5, height = 0, show.legend = F) + scale_colour_hc() + 
    scale_x_discrete(name = "debate number") + facet_wrap(~party, scales = "free_x") + 
    ggtitle("Per-debate type-token ratios by party and candidate") + theme_fivethirtyeight()

cand_ts <- with(in_cands, xts(ttr, date_time))
dygraph(cand_ts)

Say my name, say my name

let's see how many words each candidate used that referenced key democrats

dems <- c("obama", "hillary", "bernie")
cand_counts %>% group_by(speaker) %>% filter(word %in% dems, party == "republicans")
Source: local data frame [93 x 9]
Groups: speaker [10]

   counts        date       party speaker_type  speaker    word  date_time
    <int>       <chr>       <chr>        <chr>    <chr>   <chr>     <date>
1       2 06 Feb 2016 republicans    candidate    TRUMP hillary 2016-02-06
2      10 06 Feb 2016 republicans    candidate     CRUZ   obama 2016-02-06
3       2 06 Feb 2016 republicans    candidate     CRUZ hillary 2016-02-06
4       1 06 Feb 2016 republicans    candidate CHRISTIE hillary 2016-02-06
5       7 06 Feb 2016 republicans    candidate    RUBIO hillary 2016-02-06
6       3 06 Feb 2016 republicans    candidate CHRISTIE   obama 2016-02-06
7       1 06 Feb 2016 republicans    candidate     BUSH hillary 2016-02-06
8       1 06 Feb 2016 republicans    candidate     CRUZ  bernie 2016-02-06
9       1 06 Feb 2016 republicans    candidate   CARSON hillary 2016-02-06
10      3 06 Feb 2016 republicans    candidate     BUSH   obama 2016-02-06
# ... with 83 more rows, and 2 more variables: word_length <int>,
#   syl_count <int>

but we are missing words like "obamacare"; let's try again

dem_mentions <- cand_counts %>% mutate(obama = ifelse(str_detect(word, "obama"), 
    counts, 0), hillary = ifelse(str_detect(word, "hillary") | str_detect(word, 
    "secretary"), counts, 0), bernie = ifelse(str_detect(word, "sanders"), counts, 
    0)) %>% gather(key = mention_who, value = mention_count, obama:bernie) %>% 
    group_by(party, speaker, date_time, mention_who) %>% summarise(mention_sum = sum(mention_count)) %>% 
    mutate(mention_who = as.factor(mention_who))
dem_mentions
Source: local data frame [210 x 5]
Groups: party, speaker, date_time [70]

       party speaker  date_time mention_who mention_sum
       <chr>   <chr>     <date>      <fctr>       <dbl>
1  democrats  CHAFEE 2015-10-13      bernie           2
2  democrats  CHAFEE 2015-10-13     hillary           0
3  democrats  CHAFEE 2015-10-13       obama           1
4  democrats CLINTON 2015-10-13      bernie           5
5  democrats CLINTON 2015-10-13     hillary           5
6  democrats CLINTON 2015-10-13       obama          13
7  democrats CLINTON 2015-11-14      bernie           8
8  democrats CLINTON 2015-11-14     hillary           3
9  democrats CLINTON 2015-11-14       obama           3
10 democrats CLINTON 2015-12-19      bernie          10
# ... with 200 more rows

note assumptions I made about how GOPs refer by speaker to these folks debate number on x, mentions on y

library(wesanderson)  # for quirky colors
ggplot(data = subset(dem_mentions, party == "republicans"), aes(x = factor(date_time), 
    y = mention_sum, colour = mention_who, group = mention_who)) + geom_line() + 
    geom_point(size = 3) + facet_wrap(~speaker) + scale_colour_manual(values = wes_palette("Darjeeling"), 
    name = "") + scale_x_discrete(name = "") + scale_y_continuous(name = "number of mentions")

let's tidy this up a bit

tidy_dem_mentions <- dem_mentions %>%
  right_join(cand_by_debate) %>% # dplyr merge 
  mutate(mention_prop = mention_sum/tokens) %>% # relative to that speakers' proportion of words
  filter(speaker %in% c(gop_in, dem_in)) 
tidy_dem_mentions
Source: local data frame [153 x 16]
Groups: party, speaker, date_time [51]

       party speaker  date_time mention_who mention_sum debate_num
       <chr>   <chr>     <date>      <fctr>       <dbl>     <fctr>
1  democrats CLINTON 2015-10-13      bernie           5          1
2  democrats CLINTON 2015-10-13     hillary           5          1
3  democrats CLINTON 2015-10-13       obama          13          1
4  democrats CLINTON 2015-11-14      bernie           8          2
5  democrats CLINTON 2015-11-14     hillary           3          2
6  democrats CLINTON 2015-11-14       obama           3          2
7  democrats CLINTON 2015-12-19      bernie          10          3
8  democrats CLINTON 2015-12-19     hillary           2          3
9  democrats CLINTON 2015-12-19       obama           4          3
10 democrats CLINTON 2016-01-17      bernie          10          4
# ... with 143 more rows, and 10 more variables: tot_tokens <int>,
#   tot_speakers <int>, tokens <int>, types <int>, share_tokens <dbl>,
#   ttr <dbl>, avg_length <dbl>, avg_syl <dbl>, still_in <fctr>,
#   mention_prop <dbl>

date on x, mentions relative to total number of words spoken on y

ggplot(data = subset(tidy_dem_mentions, party == "republicans"), aes(x = factor(date_time), 
    y = mention_prop, group = mention_who)) + geom_line(aes(colour = mention_who), 
    lwd = 2.5, alpha = 0.7) + geom_point(colour = "black", size = 1) + facet_wrap(~speaker) + 
    scale_colour_manual(values = wes_palette("Darjeeling"), name = "") + scale_x_discrete(name = "") + 
    scale_y_continuous(name = "mentions as proportion of total number of words spoken") + 
    theme_hc() + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.ticks = element_blank())

But let's face it, for the GOP the only person worth mentioning is...

biggest_reagan_fan <- cand_counts %>% filter(party == "republicans" & speaker %in% 
    gop_in) %>% group_by(speaker, date_time) %>% mutate(reagan = ifelse(str_detect(word, 
    "reagan"), counts, 0)) %>% summarise(reagan_sum = sum(reagan))
biggest_reagan_fan
Source: local data frame [41 x 3]
Groups: speaker [?]

   speaker  date_time reagan_sum
     <chr>     <date>      <dbl>
1     BUSH 2015-09-16          5
2     BUSH 2015-10-28          0
3     BUSH 2015-11-10          2
4     BUSH 2015-12-15          1
5     BUSH 2016-01-14          0
6     BUSH 2016-01-28          1
7     BUSH 2016-02-06          2
8   CARSON 2015-09-16          2
9   CARSON 2015-10-28          1
10  CARSON 2015-11-10          0
# ... with 31 more rows

who is the biggest reagan fan?

annotate <- ggplot2::annotate  # package tm masked this annotate command
ggplot(data = biggest_reagan_fan, aes(x = speaker, y = reagan_sum)) + geom_jitter(size = 4, 
    colour = "firebrick2", width = 0.6, height = 0.1, alpha = 0.8) + stat_summary(fun.y = sum, 
    fun.ymin = sum, fun.ymax = sum, geom = "crossbar", width = 0.6, colour = "dodgerblue") + 
    facet_wrap(~date_time, switch = "x", ncol = 6, scales = "free_x") + ggtitle("Which GOP candidate is the biggest Reagan fanboy?") + 
    theme_fivethirtyeight()

mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
tot_reagan <- biggest_reagan_fan %>% group_by(speaker) %>% summarise(tot_mentions = sum(reagan_sum))
gg <- ggplot(biggest_reagan_fan)
gg <- gg + geom_bar(aes(x = date_time, y = reagan_sum, fill = speaker), stat = "identity", 
    alpha = 0.8)
gg <- gg + scale_fill_manual(values = mygop)
gg <- gg + scale_y_continuous(expand = c(0, 0))
gg <- gg + facet_wrap(~speaker, switch = "x", ncol = 6, scales = "free_x")
gg <- gg + labs(x = NULL, y = NULL, title = "Reagan mentions")
gg <- gg + theme_fivethirtyeight()
gg <- gg + theme(panel.margin.x = unit(0.25, "cm"))
gg <- gg + theme(legend.position = "none")
gg <- gg + theme(panel.grid.major.x = element_blank())
gg